Can Money Buy Happiness?

Author

Justin Koida, Abigayle Mercer, Sammy Paykel

Published

February 26, 2025

1.0 // Introduction

Data description:

Country: Country

Year: Year

Poverty: The percentage of the population in poverty (0-100). Poverty here is defined as living off of less than $3.65 per day. This is not adjusted for inflation.

Happiness Score: Happiness as a percent indicated by WHR. The World Happiness Report (WHR) is scored from the national average response to the questions of life evaluations.

Observational Unit: Country and Year

Hypothesis:

Happiness and Poverty levels are linearly correlated. So we hypothesize that money can buy happiness.

Data cleaning process and decision:

We chose to drop all observations with na values, we may go back later and choose to do some data imputation to keep more data. However, we found ththat dropping na resulted in a loss of ~800 data points from the original 3200.

1.1 // Data

Code
library(tidyverse)
library(DT)
library(plotly)
library(gridExtra)

pop <- read_csv("data/pop.csv")
hap <- read_csv("data/hapiscore_whr.csv")
pov <- read_csv("data/gm_365pov_rate.csv")

1.2, 1.3, 1.4, 1.5 // Data Cleaning and Joining

Code
clean_hap <- hap |>
  pivot_longer(cols = `2005`:`2023`,
               names_to = "Year",
               values_to = "Happiness") |>
  rename(Country = country) |>
  drop_na()

clean_pov <- pov |>
  select(`2005`:`2023`, country) |>
  pivot_longer(cols = `2005`:`2023`,
               names_to = "Year",
               values_to = "Poverty") |>
  rename(Country = country) |>
  drop_na()

joined_pov_hap <- clean_hap |>
  inner_join(clean_pov, by = join_by(Country == Country, Year == Year))

datatable(joined_pov_hap, 
          caption = htmltools::tags$caption(
            style = 'caption-side: top; text-align: left; font-size: 24px;',
            "Poverty and Happiness Data, Joined by Country and Year"
          ), 
          options = list(pageLength = 10)
          )

2.0 // Linear Regression

Displays the natural log of the ratio of the Happiness score and Poverty given year. Taking the natural log minimizes the impact for the skewness of the data, and accounts for severe outliers. The general trend of the ratio shows that either happiness increases over time or that poverty decreases:

Code
# Create a new variable: Happiness/Poverty ratio
joined_pov_hap <- joined_pov_hap |>
  mutate(Hap_Pov_Ratio = Happiness / Poverty) |>
  mutate(Log_Hap_Pov_Ratio = log(Hap_Pov_Ratio))

# Plot boxplots of the ratio over time
ggplot(joined_pov_hap, aes(x = as.factor(Year), y = Log_Hap_Pov_Ratio)) +
  #scale_y_continuous(limits = quantile(joined_pov_hap$Log_Hap_Pov_Ratio, c(.1, .9))) +
  geom_boxplot(fill = "lightblue", alpha = 0.6, outliers = FALSE, outliers.shape = NA) +  # Boxplots without outliers
  #geom_jitter(width = 0.2, alpha = 0.3, color = "blue") +  # Jitter for individual points
  
  labs(title = "Distribution of the Natural Log of Happiness/Poverty Ratio Over Time",
       x = "Year",
       y = "Ln(Happiness / Poverty Ratio)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  

Displays the relationship between the poverty rate and happiness score averaged over the years 2005 to 2023 where each point represents a different country. The general trend of points indicates that there is a negative relationship between poverty rate and happiness score; countries with higher poverty rates have lower happiness scores:

Code
pov_hap_avg <- joined_pov_hap |>
  group_by(Country) |>
  summarize(Avg_Happiness = mean(Happiness, na.rm = TRUE),
            Avg_Poverty = mean(Poverty, na.rm = TRUE), .groups = "drop")

# Scatterplot with one observation per country
scatter_plot <- ggplot(pov_hap_avg, aes(x = Avg_Poverty, y = Avg_Happiness)) +
  geom_point(alpha = 0.6, color = "darkblue") +
  # geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "Averaged Relationship Between Happiness and Poverty",
       x = "Average Poverty Rate (%)",
       y = "Average Happiness Score (%)") +
  theme_minimal()

ggplotly(scatter_plot)
2.2 // Linear Regression

Linear Regression is a supervised machine learning technique to learn the relationship between a dependent variable and one or more independent variables. Linear Regression assumes that there is some sort of linear relationship between the two variables. Linear Regression attempts to calculate the best Fit line, where the slope indicates the strength of the relationship between the two variables.

source: https://www.geeksforgeeks.org/ml-linear-regression/

Code
model <- lm(Happiness ~ Poverty, data = joined_pov_hap) #https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/predict.lm
summary(model)

Call:
lm(formula = Happiness ~ Poverty, data = joined_pov_hap)

Residuals:
    Min      1Q  Median      3Q     Max 
-41.225  -4.914   0.145   5.281  20.105 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 63.712958   0.227018  280.65   <2e-16 ***
Poverty     -0.239021   0.004437  -53.87   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.543 on 2340 degrees of freedom
Multiple R-squared:  0.5536,    Adjusted R-squared:  0.5534 
F-statistic:  2902 on 1 and 2340 DF,  p-value: < 2.2e-16
2.3 // Model Fit
Code
# print(tibble(predict(model)))

pre_variance = joined_pov_hap |>
  select(Happiness) |>
  mutate(predicted_happiness = predict(model)) |> # https://www.math.ucla.edu/~anderson/rw1001/library/base/html/predict.lm.html#:~:text=lm%20produces%20predicted%20values%2C%20obtained,of%20the%20predictions%20are%20calculated.
  mutate(residual = Happiness - predicted_happiness)

variances = pre_variance |> # https://www.marsja.se/variance-in-r-how-to-find-calculate/
  mutate(happiness_variance = var(Happiness), 
         predicted_variance = var(predicted_happiness),
         residual_variance = var(residual)) |>
  select(happiness_variance, predicted_variance, residual_variance) |>
  head(1)

knitr::kable(variances)
happiness_variance predicted_variance residual_variance
127.3947 70.5234 56.87134
Model Fit Analysis

The linear regression model predicting Happiness from Poverty produced an ( R^2 ) value of 0.5536, meaning it explains ~55.4% of the variation in happiness.

Interpretation:
  • The negative coefficient for Poverty (-0.239) suggests a moderate inverse relationship: as poverty increases, happiness tends to decrease.
  • The model is statistically significant (( p < 2.2 * 10^{-16} )).
  • However, about 45% of the variance remains unexplained, indicating other factors (e.g., GDP, social policies) likely influence happiness.

3.0 // Simulation

3.1 // Visualizing Simulated Data
Code
set.seed(123)

predicted_values <- predict(model)

simulated_happiness <- predicted_values + rnorm(length(predicted_values), mean = 0, sd = sigma(model))

simulated_data <- joined_pov_hap |>
  mutate(Simulated_Happiness = simulated_happiness)

p1 <- ggplot(joined_pov_hap, aes(x = Poverty, y = Happiness)) +
  geom_point(alpha = 0.6, color = "steelblue") +
  labs(title = "Observed Happiness vs. Poverty", x = "Poverty", y = "Happiness") +
  theme_minimal()

p2 <- ggplot(simulated_data, aes(x = Poverty, y = Simulated_Happiness)) +
  geom_point(alpha = 0.6, color = "orange") +
  labs(title = "Simulated Happiness vs. Poverty", x = "Poverty", y = "Simulated Happiness") +
  theme_minimal()

interactive_plot1 <- ggplotly(p1)
interactive_plot2 <- ggplotly(p2)

subplot(interactive_plot1, interactive_plot2, nrows = 1, margin = 0.025)
Code
library(gridExtra)
#grid.arrange(p1, p2, ncol = 2)
3.2 // Multiple Predictive Checks
Code
set.seed(321)

num_simulations <- 1000
r_squared_values <- numeric(num_simulations)

for (i in 1:num_simulations) {
  simulated_y <- predicted_values + rnorm(length(predicted_values), mean = 0, sd = sigma(model))
  temp_model <- lm(simulated_y ~ joined_pov_hap$Poverty)
  r_squared_values[i] <- summary(temp_model)$r.squared
}

r_squared_df <- tibble(R_Squared = r_squared_values)

r_squared_hist <- ggplot(r_squared_df, aes(x = R_Squared)) +
  geom_histogram(binwidth = 0.02, fill = "purple", alpha = 0.7, color = "black") +
  labs(title = "Distribution of R-Squared from Simulated Regressions",
       x = "R-Squared", y = "Frequency") +
  theme_minimal()

ggplotly(r_squared_hist)
Analysis of Simulated Predictive Checks
  • The distribution of simulated R^2 values is similar to the observed model R^2, suggesting our model is a reasonable fit for the data.